Multiple camera fusion based on DSmT for 
tracking objects on ground plane 

Esteban Garcia and Leopoldo Altamirano 
National Institute for Astrophysics, Optics and Electronics 
Puebla, Mexico 

eomargr @ inaoep.mx, robles @ inaoep. mx 



Abstract — This paper presents comparative results of a 
model for multiple camera fusion, which is based on Dezert- 
Smarandache theory of evidence. Our architecture works at 
the decision level to track objects on a ground plane using 
predefined zones, producing useful information for surveillance 
tasks such as behavior recognition. Decisions from cameras are 
generated by applying a perspective-based basic belief assignment 
function, which represent uncertainty derived from cameras 
perspective while tracking objects on ground plane. Results 
obtained from applying our tracking model to CGI animated 
simulations and real sequences are compared to the ones obtained 
by Bayesian fusion, and show how DSm theory of evidence 
overcomes Bayesian fusion for this application. 

Keywords: Multiple Cameras Fusion, Tracking, Dezert- 
Smarandache Theory, Decision Level Fusion. 

I. Introduction 

Computer vision uses information from more than one 
camera to develop several tasks, such as 3D reconstruction 
or complementing fields of view to increase surveillance 
areas, among others. Using more than one camera has some 
advantages, even if information is not fused. A simple instance 
might be having a multi-camera system where it is possible 
to cover wider area, and at the same time is more robust to 
failures where cameras overlap. 

There exists a tendency, in computer vision, to work on high 
level tasks [1] — [4], where moving objects position is not useful 
when it is given in image plane coordinates, instead of it, it 
is prefered when position is described according to predefined 
regions on ground plane. This sort of information can be used 
for behavior recognition where people behavior is described 
by mean of predefined zones of interest on scene. 

In [4] a tracking system using predefined regions is used 
to analyze behavioral patterns. In the same work, only one 
camera is used and no considerations are taken on distortions 
due to camera perspective. In [3] a Hierarchical Hidden 
Markov Model is used to identify activities, based on tracking 
people on a cell divided room. Two static cameras cover 
scene, but information coming from them is used separately, 
their purpose is to focus on different zones, but not to refine 
information. 

As cameras work by transforming information from 3D 
space into 2D space, there is always uncertainty involved. In 
order to estimate object position related to ground plane, it 
is necessary to find out its position in image plane and then 
estimate that position on ground plane. For surveillance tasks 



where objects position has to be given according to ground 
plane, it is possible to apply projective transform in order 
to estimate objects position on ground plane, however, this 
process might carry errors from perspective. 

In [5] we presented a decision level architecture to fuse 
information from cameras, reducing uncertainty derived from 
perspective on cameras. The stage of the processing at which 
data integration takes place allows an interpretation of infor- 
mation which describes better the position of objects being 
observed and at the same time is useful for high level 
surveillance systems. In our proposal, individual decisions 
are taken by means of an axis-projection-based generalized 
basic belief assignment (gbba) function and finally fused using 
Dezert-Smarandache (DSm) hybrid rule. In this work, we 
present a theoretical and practical comparison between DSm 
and a Bayesian module applied to CGI and real multicamera 
sequences. 

This paper is organized as follows: in section 2, the Dezert- 
Smarandache theory is briefly described as mathematical 
framework. In section 3, our architecture is described alto- 
gether with the gbba function we used. A comparison between 
Bayesian and DSm hybrid combination rule is presented in 
section 4. Finally in section 5 conclusions are presented. 

II. DSm hybrid model 

The DSmT defines two mathematical models used to rep- 
resent and combine information [6]: free and hybrid. 

The Free DSm model, denoted as AU(0), defines 0 = 
{6*i, . . . , 9 n } as a set or frame of n non exclusive elements and 
an hyper-power set D 0 as the set of all composite possibilities 
obtained from 0 in the following way: 

1) 0,01 ,...,O n £D 0 

2) MA £ D e , B £ D 0 , (A U B) £ D e , (An B) £ D 0 

3) D 0 is formed only by elements obtained by rules 1 or 
2 

Function m(A ) is called general basic belief assignment or 
mass for A, defined as m() : D 0 — » [0, 1], and is associated 
to a source of evidence. 

A DSm hybrid model introduces some integrity constraints on 
elements A £ D 0 when there are known facts related to those 
elements in the problem under consideration. In our work, 
exclusivity constraints are used to represent those regions on 
ground plane which are not adjacent. The restricted elements 
are forced to be empty in the hybrid model A4(0) f AU(0) 




Figure 1. Example of vertical axis obtained by two cameras, projected on 
ground plane 



and the mass is transferred to the non restricted elements. 
When DSm hybrid model is used, combination rule for two 
or more sources is defined for A € D e with these functions: 
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where <j>(A ) is called the characteristic emptiness function 
of a set A ( <f>(A ) = 1 if A 0 and (j>{A) = 0 
otherwise). 0 = {0_vj,0} where 0jvi is the set of of 
all elements of D e forced to be empty. U is defined as 
U = u(X i) U u(X 2) U . . . U u(Xk), where u(X) is the union 
of all singletons 9i £ X, while I t = 0\ U 9 2 U . . . U 9 n . 

III. Multiple cameras fusion 

In order to have a common space reference system, spatial 
alignment is required. Homography is used to relate infor- 
mation from cameras. It is possible to recover homography 
from a set of static points on ground plane [7] or dynamic 
information in scene [ 8 ]. Correspondence between objects 
detected in cameras might be achieved by features matching 
techniques [9] or geometric ones [10], [11], 

Once the homography matrix has been calculated, it is 
possible to relate information from one camera to others. 
While object is being tracked by a camera, its vertical axis 
is obtained and its length is estimated as A = Zcos(a), where 
l is the maximum length for axis when projected on ground 
plane and a is the angle of the camera respect to the ground 
plane. 



Let r = { 7 !, . . . , 7 .,, } denote ground plane partition, where 
each 7 x is a predefined region on ground plane, which might 
be an special interest zone, such as corridor or parking area. 

For each moving object i, it is created a frame (-) , = 
{9i,...,9k}- Each element 9 X represents a zone 7 ^ where 
the object i might be located, according to information from 
cameras. (-) , is built dynamically considering only the zones 
for which there exist some belief provided by at least one 
camera. 

Multiple camera fusion, in the way it is used in this work, is 
a tool for high level surveillance systems. Behavior recognition 
models might use information in the form of beliefs, such as 
fuzzy logic classifiers or probabilistic models do. Therefore, 
it is allowed for the camera to assign mass to elements in D () 
in the form of 9i D 9j, because this might represent an object 
in the border of two regions on ground plane. For couples of 
hypotheses which represent non-adjacent regions of the ground 
plane, it does not make sense consider such belief assignments, 
therefore elements in D e representing non-adjacent regions of 
ground plane, are included to 0 vi- 

Each camera behaves as an expert, assigning mass to each 
one of the unconstrained elements of D e . The assignment 
function is simple, and has as its main purpose to consider 
perspective influence on uncertainty. It is achieved by means 
of measuring intersection area between 7 .,. and object’s vertical 
axis projected on ground plane, centered on the object’s feet. 
The length of the axis projected on ground plane is determined 
by the angle of the camera respect to the ground plane, taking 
object’s ground point as the vertex to measure the angle. So 
if the camera were just above the object, its axis projection 
would be just one pixel long, meaning no uncertainty at all. 
We consider three cases to cover mass assignation showed in 
figure 2 . 

When projected axis is within a region of the ground plane, 
camera assigns full belief to that hypothesis. When axis crosses 
two regions it is possible to assign to composed hypotheses 
of the kind 9i U 0 3 and 9i fl 9j , depending on the angle of the 
camera. 

Let lo c denote the vertical axis obtained by camera c, pro- 
jected on ground plane, and \u> c \ its area. Following functions 
are used as gbba model. 
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When axis intersects more than two regions on ground 
plane, functions become: 




(a) Belief is assigned to 0,; 



Figure 3. Bayesian classifiers as fusion module 



(b) Belief is assigned to 9i, 9j, 9i U 9j and 9i fl 9j 



(c) Belief is assigned to 0$, , 9k and 9i U . . . U 9k 

Figure 2. Cases considered to belief assignment 
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v + \lo c \ is used as a normalizer in order to satisfy 
m c () — » [0,1] and Each camera can provide belief to 
elements 9 X fl 9 y £ D e , by considering couples 7 ,; and 
7 j (represented by 9 X and 9 y respectively) crossed by axis 
projection. Elements 9 X U . . .U9 X can have an associated gbba 
value, which represents local or global ignorance. We also 
restrict elements in 9 X fl . . . (i 9 y £ D B for which there is not 
a direct basic assignation made by one of the cameras, thus 
they are included in 0 vi, and calculations are simplified. That 
is possible because of the hybrid DSm model definition. 
Decision fusion is used to combine the outcomes from cam- 



eras, making a final decision. We apply hybrid DSm rule of 
combination over D e in order to achieve a final decision. 



IV. Results and discussion 

To test the proposed architecture for fusion, we used 
computer-generated-imagery sequences (figure 4) and real 



sequences from the Performance Evaluation of Tracking and 
Surveillance dataset [12], 

In CGI sequences, three cameras were simulated. We con- 
sidered a squared scenario with a grid of sixteen regular 
predefined zones. 3D modeling was done using Blender with 
Yafray as rendering machine. All generated images for se- 
quence are in a resolution of 800x600 pixels. Examples of 
images generated by rendering are shown in figure 4, where 
division lines were outlined on ground plane to have a visual 
reference of zones, but they are not required for any other task. 

As real sequences, PETS repository was used (figure 5). 
In this data set, two cameras information is provided, in a 
resolution of 768x576 pixels in JPEG format. Our architecture 
and gbba function was applied to track people, cars and 
bicycles. 

As part of the results, it is interesting to show the differences 
between DSm and a probabilistic model to fuse decisions. For 
this application, hypotheses have a geometric meaning, and we 
found that this has to be taken in consideration during fusion. 



A. Probabilistic fusion module 

For comparison purposes, a Bayesian classifier was devel- 
oped for each of the regions on ground plane, as showed in 
figure 3. A priori probability is assumed the same for each of 
the regions, while conditional probability is taken from masses 
generated by cameras, being normalized. 
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Ignorance from cameras means that a camera does not 
have a good point of view to generate its information. If a 
probabilistic model is applied ignorance is not considered and 
that might derive wrong results. Let’s consider the following 
numerical example: suppose two cameras assign following 
beliefs: 

mi (A) = 0.35 mi(B) = 0.6 nti(vlUB) = 0.05 
7712 (A) = 0.3 m2(-B) = 0.1 7?72(A U B) = 0.6 

Probabilistic model generates following decisions: 




Table I 

Results on CGI animations 



(a) Camera 1 (b) Camera 2 (c) Camera 3 



Figure 4. Example of CGI sequences 



Source 


TRDR 


FAR 


Similarity to Truth 


Camera 1 


99.5% 


52.9% 


65.2% 


Camera 2 


93.9% 


43.0% 


69.7% 


Camera 3 


84.4% 


45.3% 


23.0% 


DSm 


93.9% 


5.6% 


84.1% 


Probabilistic 


93.3% 


5.2% 


24.9% 



Table H 

Results on real sequences 



(a) Camera 1 (b) Camera 2 (c) Ground plane 

Figure 5. Example of real sequences from PETS 



Source 


TRDR 


FAR 


Similarity to Truth 


Camera 1 


68.1% 


21.7% 


31.6% 


Camera 2 


71.0% 


2.7% 


67.5% 


DSm 


82.8% 


10.2% 


75.9% 


Probabilistic 


82.8% 


10.2% 


67.9% 



P(A) 


oc 0.5 • 


0.35 


0.3 


= 0.13 


0.35 + 0.6 


0.3 + 0.1 


p(B) 


oc 0.5 • 


0.6 


0.1 


= 0.07 


0.35 + 0.6 


0.3 + 0.1 


DSm model results: 









m DS m{A) = 0.35 • 0.3 + 0.35 • 0.6 + 0.05 • 0.3 = 0.33 
m DS m{B ) = 0.6 • 0.1 + 0.6 • 0.6 + 0.05 • 0.1 = 0.42 



In decisions generated by cameras, first sensor assign higher 
mass to the hypothesis B, while second sensor assigns higher 
belief to hypothesis A. If ignorance is considered, it is clear 
that as result from fusion one must have a higher value for 
hypothesis B, because second sensor is in a better position. 
However, in probabilistic fusion decision hypothesis A is 
higher. This shows how considering ignorance may improve 
results from fusion applied to multi-cameras tracking. 

Positions obtained by fusion of the decisions of the cameras 
are showed in figures 6 and 7. Graphics show how DSm gets 
higher decision values than Bayesian fusion. 

In tables I and II metrics TRDR (Tracker Detection Rate) 
and FAR (False Alarm Rate) are showed from data collected 
from 2 CGI sequences and 5 real sequences. We also propose 
Similarity to Truth measure, to evaluate how close in values 
is the result of fusion to truth data. 

TRDR and FAR are evaluated with following equations: 



TRDR = 



TP 

TG 



FAR = 



FP 



TP+FP 



( 12 ) 

(13) 



where TG is the total number of regions by each image 
where there are objects in motion according to ground truth. 
According to this metrics, it is desirable to have the highest 
value in TRDR while the lowest in FAR. 

Similarity to Truth is a measure to quantify the differences 
between positions obtained by fusion modules compared to 
ground truth. When there exist belief assigned to certain 



position, and also exists an object on that position in ground 
truth, the amount of belief is summed, but when there is not 
object in ground truth, this amount of belief is subtracted, and 
finally, the amount obtained is normalized to be showed as 
percentage. 

Results from tables show how DSm reduces uncertainty 
from perspective and complements information where cameras 
lost object or fields of view do not overlap. Bayesian fusion be- 
haves similar to DSm, however, hybrid combination rule takes 
in consideration information assigned to ignorance, which may 
refine information such as in example from section IV-A. ST 
(Similarity to Truth) is a metric to quantify how close is belief 
assigned to regions to ground truth. In ST DSm has higher 
values, closer to ground truth. 

V. Conclusions 

Using cameras as experts at high level for processing 
objects position, allows to apply Dezert-Smarandache Theory 
to combine beliefs. Beliefs correspond to objects locations on 
ground plane, given in relation to predefined regions. 

Test showed how DSm Theory of evidence generates higher 
values as results and a better approximation to ground truth. 
In addition to this, DSmT allows belief to be assigned to 
intersection of hypotheses, which might be interpreted as an 
object in the border of two regions, and might be useful 
information for behavior recognition based on fuzzy logic, 
while probabilistic approaches does not allow this kind of 
information because of exclusivity constraints. For the fusion 
of objects position, DSmT showed better results than Bayesian 
fusion. 

Even good results were obtained using DSmH, it is known 
that when conflicting sources are combined the masses com- 
mitted to partial ignorances are increased and after a while this 
ends up to get the vacuous belief assignment. It is expected 
that DSm-PCR5 fusion rule yelds better results. 
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